Rebase pipeline scaffolding onto updated main by codegen-sh[bot] · Pull Request #9 · dermitchell1993/ANE

codegen-sh · 2026-03-05T15:34:20Z

Rebases the 3 pipeline scaffolding commits onto current main (which gained efcf193, 1a7d884, 050bc4f and more upstream).

Conflicts resolved: training/Makefile — merged both HEADERS_ANE (upstream) and HEADERS_PIPELINE (ours), plus unified clean rule to include all binaries from both feature sets.

26/26 unit tests still pass post-rebase.

Merge this into codegen-bot/pipeline-scaffolding-a7f3e2 to update PR #1 with the rebased history.

💻 View my work • 👤 Initiated by @dermitchell1993 • About Codegen
⛔ Remove Codegen from PR • 🚫 Ban action checks

Weave in scope notice near the top covering project intent, what it is/isn't, hype clarification, maintenance expectations, and fork encouragement. Consolidate private API disclaimer with existing disclaimer section to avoid duplication. https://claude.ai/code/session_01NNL4MVEY1aKp19eGHTYJUv

…tice-EL9sS Add Project Scope & Intent notice to README

…trics

…offload (16% faster) Bridge+Memory leak fix+More functions

Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps bottleneck. Weights are passed via IOSurface spatial dimension instead of baked as constants, so kernels compile once at startup (345ms) and run indefinitely without exec() restart. Key components: - training_dynamic/ — full pipeline (config, IO, MIL generators, train loop) - 9 dynamic kernels shared across all 12 layers - Vocab compaction 32K→9.2K for faster classifier - Vectorized cross-entropy with vDSP/NEON - Adam optimizer with gradient clipping + cosine LR schedule - Checkpoint save/resume - test_dynamic_matmul.m — validates dynamic weight matmul vs cblas - test_weight_patch.m — tests weight update via IOSurface - dashboard.py — updated with --dynamic flag for v2 pipeline support, improved step regex parsing, --scratch/--lr/--accum CLI args Performance: 110ms/step steady-state (no recompile overhead) ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms

- Fix positional arg parsing (model_path, steps, lr were silently ignored) - Add --model, --ckpt flags; forward ckpt_path across exec() restarts - Add --no-ane-extras to disable ANE classifier/softmax/rmsnorm_bwd - CPU fallback for softmax/classifier/rmsnorm_bwd when extras disabled - Update README with 4-way benchmark comparison table (20 steps)

- Parse static pipeline JSON step/batch/perf lines for real-time updates - Running elapsed time, ms/step from wall-clock timestamps, steps/sec - Compute ANE + Total TFLOPS from FLOPs/step when not reported directly - Support --ane (train_large_ane) and --no-ane-extras flags - Dynamic pipeline timing breakdown + CKPT_PATH per mode

… MIL pipeline [MLModel compileModelAtURL:] fails on macOS 26, breaking inmem_bench, sram_bench, and sram_probe. This switches all three to generate MIL text and weight blobs programmatically in memory (matching the working inmem_peak.m approach), bypassing CoreML disk compilation entirely. - inmem_bench.m: replace CoreML compile + file read with genMIL/buildWeightBlob - sram_bench.m: switch from _ANEClient/_ANEModel to _ANEInMemoryModel API - sram_probe.m: same _ANEClient → _ANEInMemoryModel conversion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Validate all fread() return values in model_load_weights (model.h) - Check ane_eval() return values in ane_conv_eval (forward.h) and ane_eval_k (tiny_train.m) - Log error details on ANE eval failure (ane_runtime.h) - Thread-safe RMSNorm: replace global g_rms_tmp with local allocation (stories_cpu_ops.h) - Bounds-check token indices in cross_entropy_loss, embed_lookup, embed_backward - Atomic checkpoint writes via tmp+rename pattern (tiny_train.m) - Non-destructive recompile: compile new kernels first, swap only on success (model.h) - Validate fread() in load_checkpoint (tiny_train.m)

Updated README to reflect project scope, architecture, and limitations.

…ort-dataset-underflow-fix Fix token sampling underflow for short token datasets

Fix docs: add training data download instructions

Optimize dashboard and prevent sudo hang when password needed

…hmarks Fix benchmarks for macOS 26: replace compileModelAtURL with in-memory MIL

…ta-paths Fix hardcoded TinyStories data path in train_large/train_large_ane

…ctness fix: correctness and safety improvements for training

Follow-up to PR maderix#31 — assert() aborts on bad tokens, which is too harsh for training. Skip bad tokens with a warning instead.

Community-submitted results for M1 Pro/Max, M3 Pro, M4 Pro/Max, M5. Includes training performance, peak throughput, MIL compatibility matrix, and structured JSON data.

All chips have 16 NE cores except Ultra (32 via UltraFusion). M4 38 TOPS is INT8/mixed-precision, not comparable to M3 FP16 spec.

Benchmark report now includes full Stories110M model configuration (arch, layers, dims, kernels). README updated: 12-layer results replace stale single-layer numbers, limitations reflect current state.

New files: - model_config.h: Parameterized model config with presets (Stories42M/110M, LLaMA-1B/7B), pipeline planning (compute_pipeline_plan), memory/FLOP estimation - pipeline.h: Layer-group scheduler (PipelineScheduler state machine), compile budget tracking, mmap-based cross-exec() shared tensor state, exec() restart with automatic resume - gradient_checkpoint.h: Activation checkpointing policies (ALL/BOUNDARY/SQRT/NONE), recompute tracking, memory savings estimation - train_pipeline.m: Entry point with dry-run simulation mode -- prints full execution plan for any model config, simulates scheduler state machine - Makefile: train_pipeline and train_pipeline_live targets All additive -- existing train_large.m untouched. Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>

…tests - model_config.h: Added headroom_pct field to CompileConfig, used in max_layers_per_compile() with validation (falls back to 10% for invalid values). All presets include default. --headroom CLI flag added. - pipeline.h: Tightened mmap error handling — calloc checks, size validation in mmap_state_open (file size vs header, truncation detection), sentinel/version in error message, msync/munmap return checks in close. - test_pipeline_unit.c: 23 unit tests for model_config, pipeline planning, gradient checkpoint, and FLOP estimation. Pure C, no ANE dependency. All passing. Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>

…ency, safety guards Bug fix: n_checkpointed counting wrong in CKPT_BOUNDARY/SQRT/EVERY_N - Replaced per-policy arithmetic with single post-switch loop that counts actual is_saved bits. Eliminates edge-case miscounts when last layer falls on an interval boundary. Inconsistency: headroom mismatch between planner and runtime budget - budget_init() now takes CompileConfig* and uses the same headroom_pct validation as max_layers_per_compile(). Both paths yield identical usable-budget calculations. Inconsistency: total_model_bytes() omitted global gradients - Added rms_final_grad and embed_grad terms to match mmap_compute_size(). Diagnostic output now agrees with actual allocation. Design: divide-by-zero in model_dims_init() if n_heads=0 - Guarded head_dim = dim / n_heads with n_heads > 0 check. Design: no bounds checking in mmap typed accessors - All four mmap_layer_* accessors now validate layer index and return NULL on out-of-bounds. Extracted shared mmap_dims() helper to deduplicate ModelDims reconstruction. Design: CKPT_EVERY_N interval hardcoded despite caller should set - Added custom_interval parameter to checkpoint_init(). Pass 0 for default (4), or any positive int for custom spacing. Tests: 26/26 passing (3 new: custom interval, n_checkpointed accuracy, zero-heads guard). Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>

claude and others added 30 commits March 3, 2026 00:54

Merge pull request maderix#15 from maderix/claude/add-readme-scope-no…

1b792fc

…tice-EL9sS Add Project Scope & Intent notice to README

Fix token sampling underflow on short datasets

2b3b7ae

Python Bridge+Memory leak fix+More functions

ebac5dd

optimize singleton token params in generate_text

65cfc32

fix non-interactive session error and sudo password input for powerme…

b8f09a6

…trics

Capitalize doc header

a14ce09

Merge PR maderix#19: Bridge API + ANE classifier/softmax/rmsnorm_bwd …

c330774

…offload (16% faster) Bridge+Memory leak fix+More functions

Merge dynamic training pipeline + CLI fixes + benchmark comparison

3c1aae6

Add --data path support for static training pipelines

c04168e

Fix docs: add training data download instructions

0d9e139

Revise README for clarity and project details

4a6f3e4

Updated README to reflect project scope, architecture, and limitations.

Merge pull request maderix#17 from TastyHeadphones/tastyheadphones/sh…

3efa27d

…ort-dataset-underflow-fix Fix token sampling underflow for short token datasets

Merge pull request maderix#34 from 04cb/fix/docs-add-training-data-link

37939c8

Fix docs: add training data download instructions

Merge pull request maderix#20 from guitared/main

7fbb912

Optimize dashboard and prevent sudo hang when password needed

Merge pull request maderix#27 from jskromer/fix/macos26-inmemory-benc…

44309b7

…hmarks Fix benchmarks for macOS 26: replace compileModelAtURL with in-memory MIL

Merge pull request maderix#29 from nabbilkhan/contrib/fix-training-da…

032f866

…ta-paths Fix hardcoded TinyStories data path in train_large/train_large_ane

Merge pull request maderix#31 from alvgeppetto-debug/fix/safety-corre…

05fc8f8

…ctness fix: correctness and safety improvements for training

Replace assert() with non-fatal bounds checks on token IDs

e986572

Follow-up to PR maderix#31 — assert() aborts on bad tokens, which is too harsh for training. Skip bad tokens with a warning instead.

Add cross-generation ANE benchmark report from issue #3

050bc4f

Community-submitted results for M1 Pro/Max, M3 Pro, M4 Pro/Max, M5. Includes training performance, peak throughput, MIL compatibility matrix, and structured JSON data.

Add NE core counts, clarify FP16 vs rated TOPS methodology

1a7d884

All chips have 16 NE cores except Ultra (32 via UltraFusion). M4 38 TOPS is INT8/mixed-precision, not comparable to M3 FP16 spec.

Add model config to benchmark report, update README with current results

efcf193

Benchmark report now includes full Stories110M model configuration (arch, layers, dims, kernels). README updated: 12-layer results replace stale single-layer numbers, limitations reflect current state.

codegen-sh bot assigned dermitchell1993 Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase pipeline scaffolding onto updated main#9

Rebase pipeline scaffolding onto updated main#9
codegen-sh[bot] wants to merge 30 commits intocodegen-bot/pipeline-scaffolding-a7f3e2from
codegen-bot/pipeline-scaffolding-a7f3e2-rebased

codegen-sh bot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

codegen-sh bot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants